Visualization


In this notebook, we visualize the distribution of classes in the Kaggle Right Whale Recognition Challenge data set. In a later notebook, we will learn how to augment our data set by applying affine transforms to the images, but, for right now, we will stick to using features made from the data set given by Kaggle.


In [1]:
%matplotlib inline 
#the above call us to display the seaborn plots within the IPython notebook

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv("/Users/.../Machine Learning Competitions/Kaggle/Right Whale Recognition Challenge/features/rgbHistogramsTrainSet8Bins.csv", sep = ",");

As you can see, the images in the Kaggle data set are far from being evenly distributed. Many classes have fewer than ten observations while, on the other extreme, a couple of classes have more than forty observations.


In [4]:
plot = sns.countplot(x="WhaleID", data=data, palette = "Blues_d");
# Link: ax.xaxis.set_major_formatter(plt.NullFormatter())
plot.xaxis.set_major_formatter(plt.NullFormatter())


Now let's see how many observations we have in total.


In [5]:
num_obs = data.shape[0]
print num_obs


4544

That's quite a bit of data to work with. Now, let's do a bit more analysis on the distribution using the pandas values_counts method.


In [6]:
# Make a new histogram of classes
histogram = data["WhaleID"].value_counts()

In [7]:
# Turn it into a dictionary for later use
# The dictionary is in the form {"Whale_ID" : num_observations}
histogram_dict = histogram.to_dict()

How about we plot all the classes that have more than 20 examples?


In [8]:
# This code looks is a little complicated, so let's break it down. 

# First we are using a map expression to 'map' each row index of our data 
# frame into a boolean value that tells us whether we want to include that
# row of our data frame for the indices variable.

# The first argument to the map method is a function on the indices.
# The second argument to the map is the list of our data frame indices.

# The function looks at a row of the data frame given by a particular 
# index, accesses its "WhaleID" value, passes it to the histogram_dict
# we created earlier, returns the number of observations belonging to 
# that class, then returns true or false dependent on whether the returned 
# value is greater than 20 or not.

# Link: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
# Link: http://www.python-course.eu/lambda.php

indices = map(lambda x: histogram_dict[data.ix[x,]["WhaleID"]] > 20, range(num_obs))

In [9]:
plot = sns.countplot(x = "WhaleID", data = data[indices], palette = "Greens_d")

# The below code fails b/c it requires the labels
# Link: http://stackoverflow.com/questions/26540035/rotate-label-text-in-seaborn-factorplot
# plot.set_xticklabels(rotation=90)

# So use this label adjustment code instead
# Link: http://stackoverflow.com/questions/31859285/rotate-tick-labels-for-seaborn-barplot
for item in plot.get_xticklabels():
    item.set_rotation(80)


If we're too lazy to count how many of these classes there are, we could just do it this way:


In [16]:
# Link: http://stackoverflow.com/questions/12765833/counting-the-number-of-true-booleans-in-a-python-list
print sum(histogram > 20)


29

Let's plot all the classes with less than or equal to 20 observations.


In [11]:
indices = map(lambda x: histogram_dict[data.ix[x,]["WhaleID"]] <= 20, range(num_obs))
plot = sns.countplot(x = "WhaleID", data = data[indices], palette = "Purples_d")
plot.xaxis.set_major_formatter(plt.NullFormatter())


Data Selection


Even though we have already seen some advanced data indexing previously in this notebook, it's worth taking the time to a little extra indexing work so that we can use it in future code. This way, we can make sure that our data selection methods are as clean and fast as possible.

To start out, what if we wanted to select the rows of the data frame that corresponded to an arbitrary set of classes, say, whale_36861 and whale_95270? (By the way, bonus points if you can figure out why I selected these two.) The way we do this is simple.


In [17]:
# Link: http://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exist-in-a-list
two_whales_data = data[map(lambda x: data.ix[x,]['WhaleID'] in ['whale_38681', 'whale_95370'], range(num_obs))];

Now we can print out our data.


In [18]:
print two_whales_data


       V1   V2  V3  V4  V5  V6  V7  V8  V9  V10     ...       V504  V505  \
58      0    0   0   0   0   0   0   0   0    0     ...        561     0   
119     0    0   0   0   0   0   0   0   0    0     ...        185     0   
155     0    0   0   0   0   0   0   0   0    0     ...          3     0   
197     0    0   0   0   0   0   0   0   0    0     ...          0     0   
293     0    0   0   0   0   0   0   0   0    0     ...         88     0   
301     0    0   0   0   0   0   0   0   0    0     ...          4     0   
302     0    0   0   0   0   0   0   0   0    0     ...          0     0   
306     0    0   0   0   0   0   0   0   0    0     ...        447     0   
377    66  142   0   0   0   0   0   0   0   81     ...        300     0   
505     0    0   0   0   0   0   0   0   0    0     ...         36     0   
517     0    0   0   0   0   0   0   0   0    0     ...          0     0   
529     0    0   0   0   0   0   0   0   0    0     ...          0     0   
539     0    0   0   0   0   0   0   0   0    0     ...         76     0   
561     0    0   0   0   0   0   0   0   0    0     ...        460     0   
570     0    0   0   0   0   0   0   0   0    0     ...          0     0   
611     0    0   0   0   0   0   0   0   0    0     ...         15     0   
620     0    0   0   0   0   0   0   0   0    0     ...          0     0   
712     0    0   0   0   0   0   0   0   0    0     ...          4     0   
764     0    0   0   0   0   0   0   0   0    0     ...          0     0   
835     0    0   0   0   0   0   0   0   0    0     ...         72     0   
944     0    0   0   0   0   0   0   0   0    0     ...          6     0   
993     0    0   0   0   0   0   0   0   0    0     ...        230     0   
1036    0    0   0   0   0   0   0   0   0    0     ...        123     0   
1054    0    0   0   0   0   0   0   0   0    0     ...        759     0   
1066    0    0   0   0   0   0   0   0   0    0     ...        145     0   
1072    0    0   0   0   0   0   0   0   0    0     ...          0     0   
1124    0    0   0   0   0   0   0   0   0    0     ...        284     0   
1153    0    0   0   0   0   0   0   0   0    0     ...          0     0   
1159    0    0   0   0   0   0   0   0   0    0     ...       1010     0   
1205    0    0   0   0   0   0   0   0   0    0     ...        329     0   
...   ...  ...  ..  ..  ..  ..  ..  ..  ..  ...     ...        ...   ...   
2742    0    0   0   0   0   0   0   0   0    0     ...       1234     0   
2885    0    0   0   0   0   0   0   0   0    0     ...        153     0   
2895    0    0   0   0   0   0   0   0   0    1     ...        414     0   
3168    0    0   0   0   0   0   0   0   0    0     ...         43     0   
3180    0    0   0   0   0   0   0   0   0    0     ...          0     0   
3186    0    0   0   0   0   0   0   0   0    0     ...        234     0   
3212    0    0   0   0   0   0   0   0   0    0     ...        202     0   
3229    0    0   0   0   0   0   0   0   0    0     ...        102     0   
3255    0    0   0   0   0   0   0   0   0    0     ...          6     0   
3289    0    0   0   0   0   0   0   0   0    0     ...          0     0   
3354    0    0   0   0   0   0   0   0   0    0     ...         17     0   
3469    0    0   0   0   0   0   0   0   0    0     ...        236     0   
3495    0    0   0   0   0   0   0   0   0    0     ...        281     0   
3503    0    0   0   0   0   0   0   0   0    0     ...          0     0   
3613    0    0   0   0   0   0   0   0   0    0     ...        573     0   
3624    0    0   0   0   0   0   0   0   0    0     ...          3     0   
3664    0    0   0   0   0   0   0   0   0    0     ...        187     0   
3678    0    0   0   0   0   0   0   0   0    0     ...        320     0   
3870    2    6   0   0   0   0   0   0   0    9     ...         24     0   
3912    0    0   0   0   0   0   0   0   0    0     ...         94     0   
4107    0    0   0   0   0   0   0   0   0    0     ...        734     0   
4119    0    0   0   0   0   0   0   0   0    0     ...        276     0   
4154    0    0   0   0   0   0   0   0   0    0     ...         10     0   
4227    0    0   0   0   0   0   0   0   0    0     ...          0     0   
4232    0    0   0   0   0   0   0   0   0    0     ...          3     0   
4300    0    0   0   0   0   0   0   0   0    0     ...          3     0   
4343  102    0   0   0   0   0   0   0   0    0     ...         72     0   
4408    0    0   0   0   0   0   0   0   0    0     ...        282     0   
4439    0    0   0   0   0   0   0   0   0    0     ...        201     0   
4496    0    0   0   0   0   0   0   0   0    0     ...          4     0   

      V506  V507  V508  V509  V510  V511   V512      WhaleID  
58       0     0     0     0     0    33   7684  whale_38681  
119      0     0     0     0     0   440  11387  whale_38681  
155      0     0     0     0     0    35    438  whale_95370  
197      0     0     0     0     0     0      1  whale_95370  
293      0     0     0     0     0     2    895  whale_95370  
301      0     0     0     0     0     4     12  whale_95370  
302      0     0     0     0     1   314    217  whale_38681  
306      0     0     0     0     0    90  12049  whale_95370  
377      0     0     0     0     0   153   5570  whale_38681  
505      0     0     0     0     0   604   5887  whale_95370  
517      0     0     0     0     0     3    195  whale_95370  
529      0     0     0     0     0     0      0  whale_38681  
539      0     0     0     0     0   332   7353  whale_95370  
561      0     0     0     0     0  1910  21290  whale_38681  
570      0     0     0     0     0     0      1  whale_95370  
611      0     0     0     0     0   885  10387  whale_38681  
620      0     0     0     0     0     0      2  whale_95370  
712      0     0     0     0     3   636   2163  whale_38681  
764      0     0     0     0     0    92      4  whale_95370  
835      0     0     0     0     0    61   1499  whale_38681  
944      0     0     0     0     0     0      7  whale_95370  
993      0     0     0     0     0   270  10715  whale_38681  
1036     0     0     0     0     0   485   6230  whale_38681  
1054     0     0     0     0     0    52  34854  whale_95370  
1066     0     0     0     0     0   129   7310  whale_95370  
1072     0     0     0     0     0     0    187  whale_95370  
1124     0     0     0     0     0   135  10082  whale_95370  
1153     0     0     0     0     0     3    368  whale_95370  
1159     0     0     0     0     0    32  10813  whale_38681  
1205     0     0     0     0     0   528   1418  whale_95370  
...    ...   ...   ...   ...   ...   ...    ...          ...  
2742     0     0     0     0     0    22  10522  whale_38681  
2885     0     0     0     0     0   135   1489  whale_38681  
2895     0     0     0     0     0   386   2963  whale_95370  
3168     0     0     0     0     0    71   2583  whale_95370  
3180     0     0     0     0     0     1      1  whale_38681  
3186     0     0     0     0     0    83   4797  whale_38681  
3212     0     0     0     0     0  1696  23936  whale_38681  
3229     0     0     0     0     0   269   3591  whale_95370  
3255     0     0     0     0     0    66    190  whale_95370  
3289     0     0     0     0     0     0      0  whale_95370  
3354     0     0     0     0     0     0     24  whale_95370  
3469     0     0     0     0     0   243   1382  whale_95370  
3495     0     0     0     0     0    13    555  whale_38681  
3503     0     0     0     0     0     0      1  whale_38681  
3613     0     0     0     0     0    73  21397  whale_38681  
3624     0     0     0     0     0   146    790  whale_95370  
3664     0     0     0     0     0    37   6281  whale_95370  
3678     0     0     0     0     0   377  15149  whale_38681  
3870     0     0     0     0     0    50    117  whale_38681  
3912     0     0     0     0     0  1274  18597  whale_38681  
4107     0     0     0     0     0     6  11478  whale_38681  
4119     0     0     0     0     0  1746  27751  whale_95370  
4154     0     0     0     0     3   424   2406  whale_95370  
4227     0     0     0     0     0     0      0  whale_95370  
4232     0     0     0     0     0    20     21  whale_95370  
4300     0     0     0     0     0     4     32  whale_38681  
4343     0     0     0     0     0     0    183  whale_95370  
4408     0     0     0     0     0    51   4025  whale_95370  
4439     0     0     0     0     0     7   1857  whale_38681  
4496     0     0     0     0     0     6     25  whale_38681  

[90 rows x 513 columns]

What if we wanted to use our previously created histogram variable to return a list of data frames grouped by class? This is easy, too.


In [60]:
def return_data_frames_by_class(data_frame, class_list, y_column_name):
    
    df_list = [];
    num_obs = len(data_frame.index)
    
    for class_name in class_list:

        bools = map(lambda x: is_row_part_of_class(x, data_frame, class_name, y_column_name), range(num_obs));
        df_list.append(data_frame[bools]);
  
    return df_list

def is_row_part_of_class(data_frame_index, data_frame, class_name, y_column_name):
    return data_frame.ix[data_frame_index,][y_column_name] == class_name

In [61]:
class_list = histogram.axes[0].tolist()

df_list = return_data_frames_by_class(data, class_list, 'WhaleID')

Let's see what the first element of our df_list looks like..


In [63]:
print df_list[0]


       V1  V2  V3  V4  V5  V6  V7  V8  V9  V10     ...       V504  V505  V506  \
155     0   0   0   0   0   0   0   0   0    0     ...          3     0     0   
197     0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
293     0   0   0   0   0   0   0   0   0    0     ...         88     0     0   
301     0   0   0   0   0   0   0   0   0    0     ...          4     0     0   
306     0   0   0   0   0   0   0   0   0    0     ...        447     0     0   
505     0   0   0   0   0   0   0   0   0    0     ...         36     0     0   
517     0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
539     0   0   0   0   0   0   0   0   0    0     ...         76     0     0   
570     0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
620     0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
764     0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
944     0   0   0   0   0   0   0   0   0    0     ...          6     0     0   
1054    0   0   0   0   0   0   0   0   0    0     ...        759     0     0   
1066    0   0   0   0   0   0   0   0   0    0     ...        145     0     0   
1072    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
1124    0   0   0   0   0   0   0   0   0    0     ...        284     0     0   
1153    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
1205    0   0   0   0   0   0   0   0   0    0     ...        329     0     0   
1478    0   0   0   0   0   0   0   0   0    0     ...          3     0     0   
1552    0   0   0   0   0   0   0   0   0    0     ...          1     0     0   
1746    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
1894    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
1993    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
2048    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
2063    0   0   0   0   0   0   0   0   0    0     ...        106     0     0   
2125    0   0   0   0   0   0   0   0   0    0     ...          5     0     0   
2306    0   1   0   0   0   0   0   0   0    0     ...         98     0     0   
2377    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
2559    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
2577    0   0   0   0   0   0   0   0   0    1     ...          5     0     0   
2632    0   0   0   0   0   0   0   0   0    0     ...        321     0     0   
2669    0   0   0   0   0   0   0   0   0    0     ...         12     0     0   
2895    0   0   0   0   0   0   0   0   0    1     ...        414     0     0   
3168    0   0   0   0   0   0   0   0   0    0     ...         43     0     0   
3229    0   0   0   0   0   0   0   0   0    0     ...        102     0     0   
3255    0   0   0   0   0   0   0   0   0    0     ...          6     0     0   
3289    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
3354    0   0   0   0   0   0   0   0   0    0     ...         17     0     0   
3469    0   0   0   0   0   0   0   0   0    0     ...        236     0     0   
3624    0   0   0   0   0   0   0   0   0    0     ...          3     0     0   
3664    0   0   0   0   0   0   0   0   0    0     ...        187     0     0   
4119    0   0   0   0   0   0   0   0   0    0     ...        276     0     0   
4154    0   0   0   0   0   0   0   0   0    0     ...         10     0     0   
4227    0   0   0   0   0   0   0   0   0    0     ...          0     0     0   
4232    0   0   0   0   0   0   0   0   0    0     ...          3     0     0   
4343  102   0   0   0   0   0   0   0   0    0     ...         72     0     0   
4408    0   0   0   0   0   0   0   0   0    0     ...        282     0     0   

      V507  V508  V509  V510  V511   V512      WhaleID  
155      0     0     0     0    35    438  whale_95370  
197      0     0     0     0     0      1  whale_95370  
293      0     0     0     0     2    895  whale_95370  
301      0     0     0     0     4     12  whale_95370  
306      0     0     0     0    90  12049  whale_95370  
505      0     0     0     0   604   5887  whale_95370  
517      0     0     0     0     3    195  whale_95370  
539      0     0     0     0   332   7353  whale_95370  
570      0     0     0     0     0      1  whale_95370  
620      0     0     0     0     0      2  whale_95370  
764      0     0     0     0    92      4  whale_95370  
944      0     0     0     0     0      7  whale_95370  
1054     0     0     0     0    52  34854  whale_95370  
1066     0     0     0     0   129   7310  whale_95370  
1072     0     0     0     0     0    187  whale_95370  
1124     0     0     0     0   135  10082  whale_95370  
1153     0     0     0     0     3    368  whale_95370  
1205     0     0     0     0   528   1418  whale_95370  
1478     0     0     0     0    61    234  whale_95370  
1552     0     0     0     0     0    246  whale_95370  
1746     0     0     0     0     1     74  whale_95370  
1894     0     0     0     0     1     79  whale_95370  
1993     0     0     0     0     0     27  whale_95370  
2048     0     0     0     0     5    218  whale_95370  
2063     0     0     0     0   241   7072  whale_95370  
2125     0     0     0     0    44     56  whale_95370  
2306     0     0     0     0   802   1248  whale_95370  
2377     0     0     0     0    10     56  whale_95370  
2559     0     0     0     0     2    613  whale_95370  
2577     0     0     0     0    74   1075  whale_95370  
2632     0     0     0     0    49   3130  whale_95370  
2669     0     0     0     0    88    133  whale_95370  
2895     0     0     0     0   386   2963  whale_95370  
3168     0     0     0     0    71   2583  whale_95370  
3229     0     0     0     0   269   3591  whale_95370  
3255     0     0     0     0    66    190  whale_95370  
3289     0     0     0     0     0      0  whale_95370  
3354     0     0     0     0     0     24  whale_95370  
3469     0     0     0     0   243   1382  whale_95370  
3624     0     0     0     0   146    790  whale_95370  
3664     0     0     0     0    37   6281  whale_95370  
4119     0     0     0     0  1746  27751  whale_95370  
4154     0     0     0     3   424   2406  whale_95370  
4227     0     0     0     0     0      0  whale_95370  
4232     0     0     0     0    20     21  whale_95370  
4343     0     0     0     0     0    183  whale_95370  
4408     0     0     0     0    51   4025  whale_95370  

[47 rows x 513 columns]

It prints out just what we expected. Nice!

Well, that's it for this notebook. In our next notebook, we will create a very non-statistical sampling method as a first attempt at increasing the size of our dataset and improving the distribution of observations between classes. Thanks for reading!